Univariate Data Embedding¶
- Generate Univariate IoT Data:
- The function generate_univariate_iot_data() creates a dataset for one day of IoT data with readings every 30 minutes, simulating a smart meter device. Each entry has a unique smart_meter_device_ID, reading_timestamp, and a random meter_reading.
- Embed Univariate IoT Data:
- The function embed_univariate_iot_data() uses the SentenceTransformer model to generate embeddings for each row. The data is passed as a concatenation of the device_ID, timestamp, and meter_reading for semantic embedding generation.
- Store Embeddings in FAISS:
- The embeddings are stored in a FAISS vector database using the faiss.IndexFlatL2 index, which allows for fast nearest neighbor searches.
- Query Search:
- The function retrieve_similar_data() allows querying the vector database with a specific query (like a device ID or meter reading). The query is embedded and compared to the stored embeddings to retrieve the top k=3 most similar data points.
- Displaying Results:
- The display_markdown() function prints the original and predicted data in markdown table format with headers and borders, using the to_markdown() method.
- A swarm plot visualizes the predicted univariate data using matplotlib, where the x-axis represents meter readings and the y-axis represents the device indices.
In [ ]:
%pip install -q faiss-cpu sentence-transformers pandas numpy matplotlib
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages. ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed. databricks-feature-store 0.14.3 requires pyspark<4,>=3.1.2, which is not installed. ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 2.1.3 which is incompatible. ydata-profiling 4.2.0 requires scipy<1.11,>=1.4.1, but you have scipy 1.14.1 which is incompatible. numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 2.1.3 which is incompatible. mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible. langchain 0.0.217 requires numpy<2,>=1, but you have numpy 2.1.3 which is incompatible. databricks-feature-store 0.14.3 requires numpy<2,>=1.19.2, but you have numpy 2.1.3 which is incompatible. Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
In [ ]:
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
# Step 1: Generate univariate IoT time series data (smart meter data)
def generate_univariate_iot_data():
np.random.seed(0) # For reproducibility
start_time = pd.to_datetime('2024-01-01')
# Create time series data for 1 day with 30 minutes frequency, 5 hours in total (10 periods)
time_index = pd.date_range(start=start_time, periods=11, freq='30T')
# Generate random meter readings
meter_readings = np.random.uniform(980, 1050, size=11) # Meter reading values
# Simulate smart meter device IDs (e.g., device IDs could be unique)
smart_meter_device_IDs = [f"device_{i}" for i in range(1, 12)]
# Create DataFrame
data = {
'smart_meter_device_ID': smart_meter_device_IDs,
'reading_timestamp': time_index,
'meter_reading': meter_readings
}
return pd.DataFrame(data)
# Step 2: Embed univariate IoT data using SentenceTransformer
def embed_univariate_iot_data(df):
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# Create a string representation of each row to feed into the model
texts = df.apply(lambda row: f"device: {row['smart_meter_device_ID']} timestamp: {row['reading_timestamp']} reading: {row['meter_reading']}", axis=1).tolist()
# Create embeddings
embeddings = model.encode(texts)
return embeddings
# Step 3: Store embeddings in FAISS vector database
def store_embeddings_in_faiss(embeddings):
embeddings = np.array(embeddings).astype('float32')
index = faiss.IndexFlatL2(embeddings.shape[1]) # L2 distance metric
index.add(embeddings)
return index
# Step 4: Retrieval-Augmented Generation (RAG) System to query embeddings
def retrieve_similar_data(query, index, k=3):
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
query_embedding = model.encode([query]).astype('float32')
D, I = index.search(query_embedding, k) # D is distances, I is indices of nearest neighbors
return I[0], D[0]
# Step 5: Display original data and predicted data in markdown format
def display_markdown(df, title="Data"):
print(f"\n{title}:\n")
print(df.to_markdown(index=False))
# Step 6: Visualize predicted data as a multi-group spider (radar) chart
def plot_multi_group_spider_chart(data, title):
# Normalize the data for radar chart (use meter readings, embedding distances, etc.)
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(data[['meter_reading', 'embedding_distance']].values)
# Convert the "smart_meter_device_ID" to a numerical value (e.g., using indices)
device_ids = data['smart_meter_device_ID'].astype('category').cat.codes.values
# Convert the "reading_timestamp" to a numerical value (e.g., Unix timestamp)
timestamps = data['reading_timestamp'].astype('int64') / 1e9 # Convert to seconds
# Combine all values into a single matrix for scaling
full_values = np.column_stack([scaled_values, device_ids, timestamps])
full_scaled_values = scaler.fit_transform(full_values)
# Setup the radar chart categories
categories = ['meter_reading', 'embedding_distance', 'smart_meter_device_ID', 'reading_timestamp']
# Create angles for the radar chart
num_vars = len(categories)
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1] # Close the loop
# Create the figure for the plot
fig, ax = plt.subplots(figsize=(5, 5), dpi=80, subplot_kw=dict(polar=True))
# Plot each row (predicted data) on the radar chart
for i, row in data.iterrows():
row_values = np.concatenate([full_scaled_values[i], full_scaled_values[i][:1]]) # Close the loop
ax.fill(angles, row_values, alpha=0.25)
ax.plot(angles, row_values, label=f'{row["smart_meter_device_ID"]}', linewidth=2)
# Set up the labels and ticks
ax.set_yticklabels([]) # Hide radial labels
ax.set_xticks(angles[:-1]) # Set the x-ticks to be the categories
ax.set_xticklabels(categories, fontsize=12) # Set the labels of each axis
# Add a legend and title
ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.1), fontsize=12)
plt.title(title, size=14)
plt.show()
# Main execution
if __name__ == "__main__":
# Generate univariate IoT data
df = generate_univariate_iot_data()
# Embed the univariate IoT data
embeddings = embed_univariate_iot_data(df)
# Store the embeddings in FAISS
index = store_embeddings_in_faiss(embeddings)
# Display the original univariate IoT data
display_markdown(df, "Univariate IoT Data")
# Display only first 5 embeddings of the original data
embedding_df = pd.DataFrame(embeddings)
print(f"\nEmbeddings of the Univariate IoT Data: (first 5 dimensions)")
print(f" {len(embedding_df.columns)} vector dimensions were created\n")
print(embedding_df.iloc[:,:5].head(5).to_markdown(index=False))
# User Query: Let's query by a specific element (device ID, timestamp, or reading)
user_query = "smart_meter_device_ID : device_7 reading: 1010"
print(f"\nUser Query: {user_query}\n")
# Retrieve top 3 best matches based on the query
indices, distances = retrieve_similar_data(user_query, index)
# Get the top 3 predicted data
predicted_df = df.iloc[indices].reset_index(drop=True)
predicted_df['embedding_distance'] = distances
predicted_df = predicted_df.sort_values(by='embedding_distance', ascending=True).reset_index(drop=True)
# Display predicted univariate IoT data
display_markdown(predicted_df, "Predicted Univariate IoT Data")
# Display only first 5 embeddings of the predicted data
predicted_embeddings = np.array(embeddings)[indices]
predicted_embedding_df = pd.DataFrame(predicted_embeddings)
print(f"\nFirst 5 Embeddings of the Predicted Univariate IoT Data:\n")
print(predicted_embedding_df.iloc[:,:5].head(5).to_markdown(index=False))
# Display multi-group spider chart for predicted data
plot_multi_group_spider_chart(predicted_df, "Predicted Univariate IoT Data")
Univariate IoT Data: | smart_meter_device_ID | reading_timestamp | meter_reading | |:------------------------|:--------------------|----------------:| | device_1 | 2024-01-01 00:00:00 | 1018.42 | | device_2 | 2024-01-01 00:30:00 | 1030.06 | | device_3 | 2024-01-01 01:00:00 | 1022.19 | | device_4 | 2024-01-01 01:30:00 | 1018.14 | | device_5 | 2024-01-01 02:00:00 | 1009.66 | | device_6 | 2024-01-01 02:30:00 | 1025.21 | | device_7 | 2024-01-01 03:00:00 | 1010.63 | | device_8 | 2024-01-01 03:30:00 | 1042.42 | | device_9 | 2024-01-01 04:00:00 | 1047.46 | | device_10 | 2024-01-01 04:30:00 | 1006.84 | | device_11 | 2024-01-01 05:00:00 | 1035.42 | Embeddings of the Univariate IoT Data: (first 5 dimensions) 384 vector dimensions were created | 0 | 1 | 2 | 3 | 4 | |----------:|-----------:|---------:|----------:|----------:| | -0.365002 | -0.0616462 | 0.136118 | -0.23923 | -0.274961 | | -0.297714 | -0.0963958 | 0.108235 | -0.247195 | -0.207936 | | -0.329856 | -0.096536 | 0.110829 | -0.283669 | -0.217568 | | -0.235997 | -0.0913811 | 0.148862 | -0.315279 | -0.333508 | | -0.242079 | -0.196711 | 0.16983 | -0.240875 | -0.133976 | User Query: smart_meter_device_ID : device_7 reading: 1010 Predicted Univariate IoT Data: | smart_meter_device_ID | reading_timestamp | meter_reading | embedding_distance | |:------------------------|:--------------------|----------------:|---------------------:| | device_7 | 2024-01-01 03:00:00 | 1010.63 | 26.3377 | | device_1 | 2024-01-01 00:00:00 | 1018.42 | 26.4215 | | device_4 | 2024-01-01 01:30:00 | 1018.14 | 27.0364 | First 5 Embeddings of the Predicted Univariate IoT Data: | 0 | 1 | 2 | 3 | 4 | |----------:|-----------:|---------:|----------:|----------:| | -0.287904 | -0.130074 | 0.127292 | -0.229564 | -0.270398 | | -0.365002 | -0.0616462 | 0.136118 | -0.23923 | -0.274961 | | -0.235997 | -0.0913811 | 0.148862 | -0.315279 | -0.333508 |